Machine Learning Project¶

The goal of this project is to compare the performance of two types of model in a given dataset. In particular, we will compare the performance of Logistic Regression (implemented by sklearn.linear_model.LogisticRegression) with that of a Random Forest (implemented by sklearn.ensemble.RandomForestClassifier). We will also optimize both algorithms' parameters and determine which one is best for this dataset.

Libraries¶

Let's import some libraries that will be useful in the project.

In [ ]:
import numpy as np
import pandas as pd

import seaborn as sns
import matplotlib.pyplot as plt

Exploratory Data Analysis¶

We want first to understand better the structure of the data we are working with. This is an important step in our analyis, since we want to deal with potentially problematic aspects of the data, and clean it. Moreover, by understanding its properties, such as the balance between classes, we can work with classifiers correctly and choose the right metrics to evaluate their performance.

Cleaning the dataframe¶
In [ ]:
#import the dataset
df = pd.read_csv("./data/mldata_0003164501.csv", sep = ",")
df
Out[ ]:
Unnamed: 0 label feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 ... feature_22 feature_23 feature_24 feature_25 feature_26 feature_27 feature_28 feature_29 feature_30 categorical_feature_1
0 0 3 0.977711 0.506776 0.274184 3.146263 0.970143 -0.331920 -1.369330 -0.184021 ... -1.167407 0.740888 -2.823582 1.937094 -1.095149 0.707887 -1.530358 -0.044247 -2.446235 B
1 1 1 0.047318 -4.237021 0.232601 -0.634119 -1.314572 1.993623 1.330799 1.445451 ... -0.857609 0.153576 0.582376 0.808039 0.048286 0.107600 -0.083177 -0.093708 -2.069849 A
2 2 2 1.538534 -1.704886 -0.926433 0.081837 -2.230265 0.286980 -1.556409 0.661266 ... 0.516624 -1.141832 -2.591078 1.304137 -0.385021 1.064548 -0.192902 -1.009055 -0.084599 B
3 3 2 0.677418 4.760305 3.778898 2.630025 -2.763774 -0.097002 4.955849 -0.854142 ... -0.852140 1.343323 3.636170 0.677988 0.212772 -0.677034 0.639229 -0.675410 1.036157 B
4 4 1 1.692027 0.076915 -1.736087 1.497523 -0.405362 -0.184517 -0.707944 0.550167 ... 0.680531 1.500405 -0.675614 0.234405 0.736274 0.146277 1.758071 1.209597 -0.707836 A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1295 1295 0 -1.106235 0.192021 -3.044139 -0.413591 1.622252 -0.254356 -0.101595 -0.745178 ... -1.303725 0.181726 2.327401 2.433295 0.574041 -3.224432 -0.788689 0.533658 -3.579361 C
1296 1296 2 -0.546425 -0.658874 -1.400140 -3.119142 -2.378495 -0.324184 1.249406 1.060721 ... 0.465890 -1.498137 1.581889 -1.957156 -0.931723 0.218409 -0.431815 -1.814201 2.575537 B
1297 1297 3 0.866665 2.464064 3.617311 2.733195 0.794086 -0.541519 -3.339625 2.101668 ... 0.553514 -0.616828 -4.677591 -1.403826 0.049761 0.119145 0.375289 0.681518 0.695855 B
1298 1298 0 -0.326577 1.397554 -1.358974 -4.921358 -0.640626 -0.519134 2.351460 -0.986060 ... -1.525018 -0.400171 2.559972 -3.349548 0.627411 -0.287641 -0.561310 0.642402 -0.071863 C
1299 1299 1 -0.486092 2.158804 3.344615 3.395124 0.099400 0.021686 1.859910 -0.916593 ... -2.267329 -0.665686 -0.163075 -4.088241 -1.019011 1.508333 -0.985281 1.070124 -1.429840 A

1300 rows × 33 columns

In [ ]:
#shape of the dataset
df.shape
Out[ ]:
(1300, 33)

We are given a dataset that has as first column the sample id, as second column the label, and then we have 31 features, of which the last one is categorical. We are given 1300 samples. First of all, we want to set the <Unnamed: 0> column as the index of the dataset.

In [ ]:
#set the unnamed column as the index
df.set_index('Unnamed: 0', drop = True, inplace = True)
df
Out[ ]:
label feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_22 feature_23 feature_24 feature_25 feature_26 feature_27 feature_28 feature_29 feature_30 categorical_feature_1
Unnamed: 0
0 3 0.977711 0.506776 0.274184 3.146263 0.970143 -0.331920 -1.369330 -0.184021 -0.184244 ... -1.167407 0.740888 -2.823582 1.937094 -1.095149 0.707887 -1.530358 -0.044247 -2.446235 B
1 1 0.047318 -4.237021 0.232601 -0.634119 -1.314572 1.993623 1.330799 1.445451 -0.086046 ... -0.857609 0.153576 0.582376 0.808039 0.048286 0.107600 -0.083177 -0.093708 -2.069849 A
2 2 1.538534 -1.704886 -0.926433 0.081837 -2.230265 0.286980 -1.556409 0.661266 2.436296 ... 0.516624 -1.141832 -2.591078 1.304137 -0.385021 1.064548 -0.192902 -1.009055 -0.084599 B
3 2 0.677418 4.760305 3.778898 2.630025 -2.763774 -0.097002 4.955849 -0.854142 -1.733114 ... -0.852140 1.343323 3.636170 0.677988 0.212772 -0.677034 0.639229 -0.675410 1.036157 B
4 1 1.692027 0.076915 -1.736087 1.497523 -0.405362 -0.184517 -0.707944 0.550167 -0.027586 ... 0.680531 1.500405 -0.675614 0.234405 0.736274 0.146277 1.758071 1.209597 -0.707836 A
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1295 0 -1.106235 0.192021 -3.044139 -0.413591 1.622252 -0.254356 -0.101595 -0.745178 -2.724777 ... -1.303725 0.181726 2.327401 2.433295 0.574041 -3.224432 -0.788689 0.533658 -3.579361 C
1296 2 -0.546425 -0.658874 -1.400140 -3.119142 -2.378495 -0.324184 1.249406 1.060721 -0.587969 ... 0.465890 -1.498137 1.581889 -1.957156 -0.931723 0.218409 -0.431815 -1.814201 2.575537 B
1297 3 0.866665 2.464064 3.617311 2.733195 0.794086 -0.541519 -3.339625 2.101668 0.684216 ... 0.553514 -0.616828 -4.677591 -1.403826 0.049761 0.119145 0.375289 0.681518 0.695855 B
1298 0 -0.326577 1.397554 -1.358974 -4.921358 -0.640626 -0.519134 2.351460 -0.986060 -0.976389 ... -1.525018 -0.400171 2.559972 -3.349548 0.627411 -0.287641 -0.561310 0.642402 -0.071863 C
1299 1 -0.486092 2.158804 3.344615 3.395124 0.099400 0.021686 1.859910 -0.916593 -1.786863 ... -2.267329 -0.665686 -0.163075 -4.088241 -1.019011 1.508333 -0.985281 1.070124 -1.429840 A

1300 rows × 32 columns

Now let's deal with categorical data. In particular, the last feature is a categorical variable. Therefore, we will use a one-hot encoding. We have decided this approach rather than an integer encoding since in this way we are not making any assumption on the ordering between categories. Moreover, since we have only 3 types of entries, using it will not increase the dimensionality of the dataframe much.

In [ ]:
#apply one hot encoding to the categorical variable
from sklearn.preprocessing import OneHotEncoder

encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(df[['categorical_feature_1']]).toarray())
encoder_df.columns = ['A', 'B', 'C']

df = df.join(encoder_df)
df.drop('categorical_feature_1', axis = 1, inplace = True)
df
Out[ ]:
label feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_24 feature_25 feature_26 feature_27 feature_28 feature_29 feature_30 A B C
Unnamed: 0
0 3 0.977711 0.506776 0.274184 3.146263 0.970143 -0.331920 -1.369330 -0.184021 -0.184244 ... -2.823582 1.937094 -1.095149 0.707887 -1.530358 -0.044247 -2.446235 0.0 1.0 0.0
1 1 0.047318 -4.237021 0.232601 -0.634119 -1.314572 1.993623 1.330799 1.445451 -0.086046 ... 0.582376 0.808039 0.048286 0.107600 -0.083177 -0.093708 -2.069849 1.0 0.0 0.0
2 2 1.538534 -1.704886 -0.926433 0.081837 -2.230265 0.286980 -1.556409 0.661266 2.436296 ... -2.591078 1.304137 -0.385021 1.064548 -0.192902 -1.009055 -0.084599 0.0 1.0 0.0
3 2 0.677418 4.760305 3.778898 2.630025 -2.763774 -0.097002 4.955849 -0.854142 -1.733114 ... 3.636170 0.677988 0.212772 -0.677034 0.639229 -0.675410 1.036157 0.0 1.0 0.0
4 1 1.692027 0.076915 -1.736087 1.497523 -0.405362 -0.184517 -0.707944 0.550167 -0.027586 ... -0.675614 0.234405 0.736274 0.146277 1.758071 1.209597 -0.707836 1.0 0.0 0.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1295 0 -1.106235 0.192021 -3.044139 -0.413591 1.622252 -0.254356 -0.101595 -0.745178 -2.724777 ... 2.327401 2.433295 0.574041 -3.224432 -0.788689 0.533658 -3.579361 0.0 0.0 1.0
1296 2 -0.546425 -0.658874 -1.400140 -3.119142 -2.378495 -0.324184 1.249406 1.060721 -0.587969 ... 1.581889 -1.957156 -0.931723 0.218409 -0.431815 -1.814201 2.575537 0.0 1.0 0.0
1297 3 0.866665 2.464064 3.617311 2.733195 0.794086 -0.541519 -3.339625 2.101668 0.684216 ... -4.677591 -1.403826 0.049761 0.119145 0.375289 0.681518 0.695855 0.0 1.0 0.0
1298 0 -0.326577 1.397554 -1.358974 -4.921358 -0.640626 -0.519134 2.351460 -0.986060 -0.976389 ... 2.559972 -3.349548 0.627411 -0.287641 -0.561310 0.642402 -0.071863 0.0 0.0 1.0
1299 1 -0.486092 2.158804 3.344615 3.395124 0.099400 0.021686 1.859910 -0.916593 -1.786863 ... -0.163075 -4.088241 -1.019011 1.508333 -0.985281 1.070124 -1.429840 1.0 0.0 0.0

1300 rows × 34 columns

Now that we have finished cleaning the dataset, we want to create a dataframe containing the labels for each sample, and another dataframe containing just the features. This will be useful later on.

In [ ]:
#labels and features dataframe
df_label = df['label']
df = df.iloc[:, 1:]
df_T = df.transpose()
Analysis of the dataframe¶

Now that we have fixed the structure of the data, we can start analysing more the properties of it.

In [ ]:
#informations about the dataframe
df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1300 entries, 0 to 1299
Data columns (total 33 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   feature_1   1300 non-null   float64
 1   feature_2   1300 non-null   float64
 2   feature_3   1300 non-null   float64
 3   feature_4   1300 non-null   float64
 4   feature_5   1300 non-null   float64
 5   feature_6   1300 non-null   float64
 6   feature_7   1300 non-null   float64
 7   feature_8   1300 non-null   float64
 8   feature_9   1300 non-null   float64
 9   feature_10  1300 non-null   float64
 10  feature_11  1300 non-null   float64
 11  feature_12  1300 non-null   float64
 12  feature_13  1300 non-null   float64
 13  feature_14  1300 non-null   float64
 14  feature_15  1300 non-null   float64
 15  feature_16  1300 non-null   float64
 16  feature_17  1300 non-null   float64
 17  feature_18  1300 non-null   float64
 18  feature_19  1300 non-null   float64
 19  feature_20  1300 non-null   float64
 20  feature_21  1300 non-null   float64
 21  feature_22  1300 non-null   float64
 22  feature_23  1300 non-null   float64
 23  feature_24  1300 non-null   float64
 24  feature_25  1300 non-null   float64
 25  feature_26  1300 non-null   float64
 26  feature_27  1300 non-null   float64
 27  feature_28  1300 non-null   float64
 28  feature_29  1300 non-null   float64
 29  feature_30  1300 non-null   float64
 30  A           1300 non-null   float64
 31  B           1300 non-null   float64
 32  C           1300 non-null   float64
dtypes: float64(33)
memory usage: 377.6 KB
In [ ]:
#types contained in the dataframe
df.dtypes.unique()
Out[ ]:
array([dtype('float64')], dtype=object)
In [ ]:
#number of null entries
print("Number of NaN entries: ", df.isna().sum().sum())
Number of NaN entries:  0

Notice that the feature dataframe has only numerical features, and that we don't have any missing values.

In [ ]:
#unique label types
print("Unique label types: ", df_label.unique())
print(df_label.value_counts())
Unique label types:  [3 1 2 0]
3    343
0    334
2    314
1    309
Name: label, dtype: int64

Notice that we are dealing with a more-or-less balanced dataset. We could check also for duplicate rows / columns:

In [ ]:
#find duplicated rows and columns
df_duplicate_rows = df[df.duplicated(keep=False)]
print("Number of duplicate rows: ", df_duplicate_rows.shape[0])

df_duplicate_cols = df_T[df_T.duplicated(keep=False)]
print("Number of duplicate cols: ", df_duplicate_cols.shape[0])
Number of duplicate rows:  0
Number of duplicate cols:  0
Distribution of the dataframe¶

Now, let's check whether the data follows a normal distribution, and then we want to visualize the various samples. We are going to first analyse the skewness and the kurtosis of the dataframe. These measures tell us correspectively how symmetric is the data and how centered around the mean it is.

In [ ]:
#skewness and kurtosis of the whole dataframe
from scipy.stats import kurtosis, skew

df_numerical = df_T.iloc[1:31, :].astype(float)

df_flattened = pd.Series(df_numerical.values.flatten())
skew_df = skew(df_flattened)
kurtosis_df = kurtosis(df_flattened)
print("Skewness of the whole dataframe: ", round(skew_df))
print("Kurtosis of the whole dataframe: ", round(kurtosis_df))
Skewness of the whole dataframe:  0
Kurtosis of the whole dataframe:  2
In [ ]:
#skewness for each single sample
df_colnames = list(df_numerical.columns)
colN = df_numerical.shape[1]
df_skew = []

for i in range(colN):
    v_df = df_T.loc['feature_1':'feature_30', df_colnames[i]].astype(float)
    df_skew += [skew(v_df)] 

sns.histplot(df_skew, bins=100, kde=True)
plt.show()
In [ ]:
#kurtosis for each single sample
df_kurtosis = []

for i in range(colN):
    v_df = df_T.loc[ 'feature_1':'feature_30', df_colnames[i]].astype(float)
    df_kurtosis += [kurtosis(v_df)]

sns.histplot(df_kurtosis, bins=100, kde=True)
plt.show()

In our case, we see that we have low values both of skewness and kurtosis, and this means that the data is similar to a normal distribution. In particular, low skewness indicates that the data is approximately symmetrical, and low kurtosis means that the distribution has almost the same tail behavior as the normal distribution.

Now let's visualize the distribution of the samples.

In [ ]:
#distribution of the samples
fig, ax = plt.subplots(1,1, figsize = (5,5))
for i in range(0,5):
    col = df_T[i].loc['feature_1':'feature_30'].astype(float)
    sns.kdeplot(col);
plt.plot()
Out[ ]:
[]
In [ ]:
#obtaining random samples
random_columns = df_T.columns.tolist()
np.random.shuffle(random_columns)
In [ ]:
# boxplot and violin plot of 50 samples
col = df_T.loc['feature_1':'feature_30', random_columns[0:50]].astype(float)

fig, ax= plt.subplots(2, figsize=(10, 10))
plt.suptitle("Box/violin plot of 50 random samples ")
ax[0].boxplot(col)
ax[1].violinplot(col)
ax[0].set_xlabel("Boxplot")
ax[1].set_xlabel("Violin plot")
ax[0].get_xaxis().set_visible(False)
ax[1].get_xaxis().set_visible(False)

plt.show()

Notice that also graphically the data seems to resemble a normal distribution.

Now let's analyse the distribution of the features:

In [ ]:
#description of dataframe
df.describe()
Out[ ]:
feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 feature_10 ... feature_24 feature_25 feature_26 feature_27 feature_28 feature_29 feature_30 A B C
count 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 ... 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000 1300.000000
mean 0.057059 -0.358120 -0.353362 0.248850 -0.548936 0.113459 -0.044129 0.106316 0.130741 0.050906 ... 0.130333 -0.388395 0.061304 0.137229 0.086137 0.083467 -0.200111 0.346154 0.360769 0.293077
std 1.016393 2.331683 2.327112 2.276689 2.238338 0.997191 2.367311 0.999200 0.995196 0.963253 ... 2.358335 2.388649 0.998118 1.015407 1.033255 1.004888 2.290431 0.475926 0.480408 0.455349
min -3.407686 -6.866555 -8.196498 -7.137449 -6.330767 -3.207747 -6.896524 -3.064150 -2.930279 -3.645938 ... -7.352388 -8.961599 -3.368877 -3.324403 -4.199878 -3.491512 -9.753380 0.000000 0.000000 0.000000
25% -0.628012 -1.978855 -1.939604 -1.274770 -2.134220 -0.574609 -1.742431 -0.601565 -0.522857 -0.573218 ... -1.513865 -1.936061 -0.615284 -0.571044 -0.620587 -0.572067 -1.744427 0.000000 0.000000 0.000000
50% 0.046792 -0.436259 -0.388752 0.300348 -0.617970 0.144132 -0.059742 0.103381 0.135093 0.095434 ... 0.070840 -0.472139 0.061542 0.118818 0.026834 0.139781 -0.257736 0.000000 0.000000 0.000000
75% 0.756940 1.193450 1.283804 1.830907 0.809936 0.808414 1.587650 0.816937 0.790001 0.682353 ... 1.699240 1.301831 0.751273 0.840807 0.846782 0.748021 1.341203 1.000000 1.000000 1.000000
max 3.086628 6.955199 9.391982 8.481066 8.694150 3.307266 7.817989 3.569670 3.331400 2.914597 ... 9.359668 7.756389 3.098626 3.138477 3.144868 3.224480 9.163436 1.000000 1.000000 1.000000

8 rows × 33 columns

In [ ]:
# boxplot and violin plot of the features
col = df.loc[:, 'feature_1':'feature_30'].astype(float)

fig, ax= plt.subplots(2, figsize=(10, 10))
plt.suptitle("Box/violin plot of 50 random samples ")
ax[0].boxplot(col)
ax[1].violinplot(col)
ax[0].set_xlabel("Boxplot")
ax[1].set_xlabel("Violin plot")
ax[0].get_xaxis().set_visible(False)
ax[1].get_xaxis().set_visible(False)

plt.show()
In [ ]:
#histogram of the features
df.iloc[:, :30].hist(bins=20, figsize=(20,15))
plt.show()

The distribution seems to be a normal one. We have still decided to retain the few outliers, since the data seems still to follow a normal distribution. Now we could also analyse their correlation, in order to assess whether there is some relation between features. We are going to analyse the Pearson correlation coefficient, that measure the linear correlation between them.

In [ ]:
#correlation matrix of the features
corr_matrix_p = df.corr(method = 'pearson')

plt.figure(figsize=(10,5))
sns.heatmap(corr_matrix_p, cmap='coolwarm', yticklabels = False, xticklabels = False)
plt.xlabel('Feature')
plt.ylabel('Feature')
plt.show()
In [ ]:
#scatter matrix of the features
from pandas.plotting import scatter_matrix

attributes = ["feature_1", "feature_2", "feature_3", "feature_4", "feature_5"]
scatter_matrix(df[attributes], figsize=(12, 8));

The features seems to be uncorrelated between them.

Visualization of the dataframe¶

Now, let's visualize a bit our data, to see how the various classes are distributed in the space. We are going to use PCA in order to visualize our data in a lower dimensional space. Recall that it is a linear dimensionality reduction tehcnique.

In [ ]:
#apply PCA to our data
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

sc = StandardScaler()
 
sc.fit(df)
sc_data = sc.transform(df)

principal = PCA()
principal.fit(sc_data)
x = principal.transform(sc_data)
In [ ]:
#plot our data in a 2D and 3D space
from mpl_toolkits.mplot3d import Axes3D

cmap = {0: 'red', 1: 'blue', 2: 'purple', 3: 'green'}
colors = [cmap[lab] for lab in df_label]

fig = plt.figure(figsize=(12, 6))
ax = [None, None] 

ax[0] = fig.add_subplot(121)
ax[0].scatter(x[:,0], x[:,1], c=colors)
ax[0].set_xlabel('PC1')
ax[0].set_ylabel('PC2')
ax[0].grid(True)
handles = [plt.plot([], [], marker="o", ms=10, ls="", mec=None, color=val, label=key)[0] for key, val in cmap.items()]
ax[0].legend(handles=handles, numpoints=1, loc='lower right')

ax[1] = fig.add_subplot(122, projection='3d')
ax[1].scatter(x[:,0], x[:,1], x[:,2], c = colors)
ax[1].set_xlabel("PC1", fontsize=10)
ax[1].set_ylabel("PC2", fontsize=10)
ax[1].set_zlabel("PC3", fontsize=10)

plt.tight_layout()
plt.show()

Notice that both in 2D and in 3D the data is not linearly separable. However, in 3D there seems to be some sort of separation, at least for the 0 and 1 class. So, we might intuitively think that the 0 class will be classified better, followed by the 1 class, while the 2 and 3 class will be classified worst. Morevover, due to their proximity, we expect those two classes to be misclassified mainly between them.

Therefore, overall we don't expect a great performance of our classifiers, especially of logistic regression. This is because it has a linear decision boundary, and therefore it cannot classify non-linearly separable data well. Instead, random forest is a model that is capable of perfectly learning any decision boundary, if we don't give it any constraint. Therefore, we expect that, when initially training the model, it will be able to perfectly classify the training set. However, we don't expect the algorithm to generalize that well with also new data, since they don't seems to be well separated. To deal with data of this type, that are not much separable, it may be better to use distance-based classifiers.

In [ ]:
#cumulative variance explained 
cumsum = np.cumsum(principal.explained_variance_ratio_)
d = np.argmax(cumsum >= 0.95) + 1
plt.figure(figsize=(6, 4))
plt.plot(cumsum, linewidth=3)
plt.xlabel("Dimensions")
plt.ylabel("Explained Variance")
plt.grid(True)
plt.show()
In [ ]:
#explained variance for each component
plt.plot(principal.explained_variance_ratio_)
plt.ylabel('Explained Variance')
plt.xlabel('Components')
plt.grid(True)
plt.show()

Notice that all of the features seems to convey information. In fact, from the cumulative explained variance, we see that lots of component are needed in order to get an high explained variance.

Analysis¶

In [ ]:
#import libraries
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score, f1_score, classification_report, matthews_corrcoef, roc_auc_score, ConfusionMatrixDisplay
from sklearn import metrics 

import warnings
warnings.filterwarnings('ignore')

Now, we want to actually start analysing the two classifiers. Our approach will be to first apply each classifier, and do hyperparameter tuning. Then, for each classifier, we want to check whether modifying the features will lead to changes in the performance (feature selection).

In order to start, the first thing that we want to do is to split the data into a training and a testing set, in order to perform cross validation and find the generalization error. This method is called hold-out method and it is the simplest. In particular, we are going to train the models on the training set, and finding the correct hyperparameters, and then we will evaluate their performance on the test set. While this is the simplest approach, it may come with a problem. That is, it may happen that this technique let us choose a model that fits well only with respect to the test set, and not general data. This is the case when the test set doesn't represent the population. In our case however, since we have almost 300 entries and we stratified it, we can expect the test set to represent well the entire population.

In [ ]:
#splitting the dataset
X_train, X_test, y_train, y_test= train_test_split(df, df_label, stratify = df_label, test_size = 0.20, random_state = 42)
In [ ]:
print("Training set size: ", X_train.shape)
print("Test set size: ", X_test.shape)
Training set size:  (1040, 33)
Test set size:  (260, 33)

Notice that here we are doing stratified sampling, that is we are trying to get a testing sample that is representative of the whole population. This allow us not to introduce bias when performing this splitting.

Random Forest¶

Let's start with the Random Forest algorithm. Recall that Random Forest is a supervised machine learning model for classification. It is a form of ensemble learning, that uses decision trees and then performs a majority vote on their result to get a prediction.

Applying the model¶

First of all, we apply the standard Random Forest algorithm, to check how it performs.

In [ ]:
#training the classifier
rf_classifier = RandomForestClassifier(n_estimators = 100, random_state = 42) 
rf_classifier.fit(X_train, y_train)
y_pred = rf_classifier.predict(X_test) 

Here we have used n_estimators=100 decision trees in order to perform our majority vote. After having trained the model on our training dataset, and after having predicted the testing labels, we want to check the performance of this model.

In order to do so, we are going to use various metrics, that provide detailed evaluation of a model performance at the class level. We are first going to use the confusion matrix. It is a nice way to visualize how our model classify the different classes. Each row of the matrix represents the instances in a predicted class, while each column represents the instances in an actual class.

In [ ]:
#creating the confusion matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(confusion_matrix = cm).plot();

It may seem that the class for which the model behaves the best is the 0 and 1 class, while the 2 and 3 class seems to be classified in a bad way, and they are misclassified between them. This is exactly what we expected after having visualized the data through PCA! Moreover, we also see that the classifier seems to separate pretty well the 1 class with the 2 and 3. Now, we need some statistics to read better the matrix.

The evaluation metrics that we can use are the following:

  • Accuracy: It measures the proportion of correct predictions made by the model out of the total number of prediction, that is $\frac{\text{TP + TN}}{\text{TP+TN+FP+FN}}$. This metric can be particularly misleading in imbalanced datasets.
  • Precision: It measures the proportion of true positive predictions among all the instance predicted as positive by the model, that is $\frac{\text{TP}}{\text{TP + FP}}$. Therefore, a higher precision indicates that the model has a lower rate of incorrectly predicting negative instances as positive.
  • Recall: It measures the proportion of true positive predictions out of all actual positive instances in the dataset, that is $\frac{\text{TP}}{\text{TP + FN}}$. So it indicates the ability of a classifier to find all correct instances per class.
  • F1 score: It is a weighted harmonic mean of precision and recall, normalized between 0 and 1, that is $\frac{2 \cdot \text{precision} \cdot \text{recall}}{\text{precision + recall}}$. Therefore, it balances both precision and recall, and it incorporates the ability of the model both to avoid false positives and to capture all actual positives. A high F1 score is useful where both high recall and precision are important.
  • ROC-AUC score (Receiver Operating Characteristic - Area Under the Curve): It measures the model's ability to discriminate between positive and negative instances by plotting the true positive rate (TPR) against the false positive rate (FPR) at various classification thresholds. In particular, the ROC curve is created by varying the classification threshold of the model and calculating the TPR and FPR at each threshold. The AUC score represents the area under this curve, which ranges from 0 to 1. So, this measure tells how much the model is capable of distinguishing between classes.

Since here we are dealing with multiclass classification, some of these measures, such as precision and recall, cannot be computed directly. What it is done is to compute their statistic for each class, and then aggregating them, in order to extract a single number. They are aggregated using:

  • macro-average: It calculates the metric autonomously for each class to calculate the average. So it is just a simple average over all the classes.
  • micro-average: It calculates the average metric from the aggregate contribution of all classes. So the result is based on the proportion of every class.

We go for micro-average since we have a balanced dataset, and since by doing this we obtain an understandable metric for the overall performance regardless of the class. However, notice that micro-average gives equals result in the case for Precision, Recall and F1. We are still going to write them.

We could use also other measures, such as:

  • MCC (Mathews Correlation Coefficient): It computes the correlation coefficient between the observed and predicted classifications. Therefore, +1 indicates a perfect prediction, 0 indicates random prediction, and -1 indicates total disagreement between the predicted and actual labels.
In [ ]:
#giving some metrics 
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred, average = 'micro')
recall = recall_score(y_test, y_pred, average = 'micro')
f1 = metrics.f1_score(y_test, y_pred, average = 'micro')

print("Accuracy:", accuracy.round(5))
print("Precision:", precision.round(5))
print("Recall:", recall.round(5))
print("F1 score: ", f1.round(5))
Accuracy: 0.77308
Precision: 0.77308
Recall: 0.77308
F1 score:  0.77308
In [ ]:
#giving the ROC-AUC score, using macro in this case
def roc_auc_score_multiclass(actual_class, pred_class, average = "macro"):
  unique_class = set(actual_class)
  roc_auc_dict = {}
  for per_class in unique_class:
    other_class = [x for x in unique_class if x != per_class]

    new_actual_class = [0 if x in other_class else 1 for x in actual_class]
    new_pred_class = [0 if x in other_class else 1 for x in pred_class]

    roc_auc = roc_auc_score(new_actual_class, new_pred_class, average = average)
    roc_auc_dict[per_class] = roc_auc

  return roc_auc_dict

lr_roc_auc_multiclass = roc_auc_score_multiclass(y_test, y_pred)
print(lr_roc_auc_multiclass)

temporary = []
for i in lr_roc_auc_multiclass.values():
  temporary.append(i)
print("ROC-AUC Score: ", sum(temporary) / len(temporary))
  
{0: 0.92672647127059, 1: 0.8955685891169763, 2: 0.7714527435339618, 3: 0.8017769607843138}
ROC-AUC Score:  0.8488811911764604
In [ ]:
#giving the matthews corr coefficient
mc = matthews_corrcoef(y_test,y_pred)
print('Matthews correlation coefficient:', mc.round(5))
Matthews correlation coefficient: 0.69964

Let's join each of this metric into a big function:

In [ ]:
#create a performance metric function
def performance_metric(y_t, y_p):
  print("-- PERFORMANCE OF THE MODEL -- ")
  cm = confusion_matrix(y_t, y_p)
  disp = ConfusionMatrixDisplay(confusion_matrix = cm);

  accuracy = accuracy_score(y_t, y_p)
  precision = precision_score(y_t, y_p, average = 'micro')
  recall = recall_score(y_t, y_p, average = 'micro')
  f1 = f1_score(y_t, y_p, average = 'micro')
  
  print("General statistics:")
  print(" - Accuracy:", accuracy.round(5))
  print(" - Precision:", precision.round(5))
  print(" - Recall:", recall.round(5))
  print(" - F1 score: ", f1.round(5))

  roc_auc_multiclass = roc_auc_score_multiclass(y_t, y_p)
  
  temporary = []
  for i in roc_auc_multiclass.values():
    temporary.append(i)
  roc_auc_score = sum(temporary) / len(temporary)
  print(" - AUC Score: ", roc_auc_score.round(5))

  mc = matthews_corrcoef(y_t, y_p)
  print(' - Matthews_corrcoef: {0:.3f}'.format(mc.round(5)))

  print(classification_report(y_t, y_p))

  print("\nConfusion matrix:")
  disp.plot()
  plt.grid(False)
  plt.show()
  
  metrics = pd.Series({'accuracy': accuracy, 'precision': precision, 'recall': recall,'f1': f1,'roc_auc': roc_auc_score, 'matthews': mc, 'confusion': disp})
  return metrics
Hyperparameter selection¶

Now we want to fine tune Random Forest, in order to try and improve the accuracy of the model. In particular, we are going to perforrm hyperparameter selection, to understand which parameters are better for our data.

Due to computational and time reasons, the approach that we will follow is the following. We are first going to apply Random Search Cross Validation, in which we consider a wide range of values for the parameters. This class tries out a given number of random combinations by selecting a random value for each hyperparameter at every iteration. It is usually faster than GridSearch; however it doesn't necessarily find the optimal. Moreover, if we were just to consider the best parameters obtained with this, this may not give us the best insight on the best parameters to use, since we have missed on lots of possible combinations.

Therefore, we will use the parameters that performed the best, and we will perform a Grid Search on them, in which all possible combinations of hyperparameters values will be tried, using cross-validation. Using cross validation in this step is important since, otherwise, there would be a risk of overfitting on the training set. By performing k-fold cross-validation for hyperparameter selection, the performance measure will be the average of the values computed for each split. This allows not to lose to much data from this process.

We decided this approach this since Grid Search is computationally intensive. So, while it is the most exhaustive method in which we can perform hyperparameter tuning, we might want to explore other ways in which we could perform it. So by doing this, we are trying to explore relatively few combinations extensively, after having decreased the size of our parameter space.

In [ ]:
#definition of parameters to try for random search
from sklearn.model_selection import RandomizedSearchCV

n_estimators = [int(x) for x in np.linspace(start = 100, stop = 1000, num = 10)]
criterion = ['gini', 'entropy']
max_depth = [int(x) for x in np.linspace(10, 220, num = 8)]
min_samples_split = [2, 5, 10, 15, 20]
min_samples_leaf = [1, 2, 4, 8, 12]
max_features = [5, 10, 20, 'sqrt']
min_impurity_decrease = [0, 0.1, 0.2, 0.5, 0.8]
bootstrap = [True, False]

param_grid = {
    'n_estimators': n_estimators,
    'criterion': criterion,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf,
    'max_features': max_features,
    'min_impurity_decrease': min_impurity_decrease,
    'bootstrap': bootstrap
}
In [ ]:
#apply random search
rf_random = RandomForestClassifier(random_state = 42, warm_start = True, n_jobs = -1)
rf_random_search = RandomizedSearchCV(rf_random, param_grid, n_iter = 2500, cv = 5, verbose = 1, random_state=42, n_jobs = -1)
rf_random_search.fit(X_train, y_train)
Fitting 5 folds for each of 2500 candidates, totalling 12500 fits
Out[ ]:
RandomizedSearchCV(cv=5,
                   estimator=RandomForestClassifier(n_jobs=-1, random_state=42,
                                                    warm_start=True),
                   n_iter=2500, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'criterion': ['gini', 'entropy'],
                                        'max_depth': [10, 40, 70, 100, 130, 160,
                                                      190, 220],
                                        'max_features': [5, 10, 20, 'sqrt'],
                                        'min_impurity_decrease': [0, 0.1, 0.2,
                                                                  0.5, 0.8],
                                        'min_samples_leaf': [1, 2, 4, 8, 12],
                                        'min_samples_split': [2, 5, 10, 15, 20],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500, 600, 700, 800,
                                                         900, 1000]},
                   random_state=42, verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomizedSearchCV(cv=5,
                   estimator=RandomForestClassifier(n_jobs=-1, random_state=42,
                                                    warm_start=True),
                   n_iter=2500, n_jobs=-1,
                   param_distributions={'bootstrap': [True, False],
                                        'criterion': ['gini', 'entropy'],
                                        'max_depth': [10, 40, 70, 100, 130, 160,
                                                      190, 220],
                                        'max_features': [5, 10, 20, 'sqrt'],
                                        'min_impurity_decrease': [0, 0.1, 0.2,
                                                                  0.5, 0.8],
                                        'min_samples_leaf': [1, 2, 4, 8, 12],
                                        'min_samples_split': [2, 5, 10, 15, 20],
                                        'n_estimators': [100, 200, 300, 400,
                                                         500, 600, 700, 800,
                                                         900, 1000]},
                   random_state=42, verbose=1)
RandomForestClassifier(n_jobs=-1, random_state=42, warm_start=True)
RandomForestClassifier(n_jobs=-1, random_state=42, warm_start=True)

Now, we can actually see which are the best parameters and check their performance on the test set.

In [ ]:
#best parameters for random search
print("Best parameters: ", rf_random_search.best_params_)
rf_random_model = rf_random_search.best_estimator_
Best parameters:  {'n_estimators': 300, 'min_samples_split': 2, 'min_samples_leaf': 1, 'min_impurity_decrease': 0, 'max_features': 'sqrt', 'max_depth': 220, 'criterion': 'entropy', 'bootstrap': False}
In [ ]:
#check the performance of best model on test set
rf_random_model.fit(X_train, y_train)
y_pred = rf_random_model.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL -- 
General statistics:
 - Accuracy: 0.81154
 - Precision: 0.81154
 - Recall: 0.81154
 - F1 score:  0.81154
 - AUC Score:  0.87418
 - Matthews_corrcoef: 0.749
              precision    recall  f1-score   support

           0       0.85      0.91      0.88        67
           1       0.79      0.85      0.82        62
           2       0.76      0.71      0.74        63
           3       0.84      0.76      0.80        68

    accuracy                           0.81       260
   macro avg       0.81      0.81      0.81       260
weighted avg       0.81      0.81      0.81       260


Confusion matrix:
Out[ ]:
accuracy                                              0.811538
precision                                             0.811538
recall                                                0.811538
f1                                                    0.811538
roc_auc                                               0.874178
matthews                                              0.749418
confusion    <sklearn.metrics._plot.confusion_matrix.Confus...
dtype: object

Notice that even from this simple search this model has improved. In particular, we passed from an accuracy of $0.773$ to an accuracy of $0.811$. We also have an higher value in the Auc score ($0.848$ vs $0.874$) and in the Matthews correlation coefficient ($0.699$ vs $0.749$). So overall all metrics improved.

Now, let's see more closely how do the best models in random search behave and what is their accuracy.

In [ ]:
#giving the accuracy of the first 10 models
rs_results = pd.DataFrame(rf_random_search.cv_results_).sort_values('rank_test_score').reset_index(drop=True)
top_10_parameters= rs_results.iloc[:10, :]

models = {}
for i in top_10_parameters.index:
    models[str(top_10_parameters.loc[i, 'params'])] = top_10_parameters.loc[i, 'split0_test_score':'split4_test_score'].values

results = []
names = []
for model, score in models.items():
	results.append(score)
	names.append(model)
plt.boxplot(results, showmeans = True)
plt.title("Accuracy of top 10 models")
plt.ylabel("Accuracy")
plt.show()

Notice that the best model we found has both an high mean and median.

In [ ]:
top_10_parameters
Out[ ]:
mean_fit_time std_fit_time mean_score_time std_score_time param_n_estimators param_min_samples_split param_min_samples_leaf param_min_impurity_decrease param_max_features param_max_depth ... param_bootstrap params split0_test_score split1_test_score split2_test_score split3_test_score split4_test_score mean_test_score std_test_score rank_test_score
0 2.432691 0.676007 0.318548 0.284475 300 2 1 0 sqrt 220 ... False {'n_estimators': 300, 'min_samples_split': 2, ... 0.812500 0.725962 0.730769 0.826923 0.783654 0.775962 0.041313 1
1 13.739838 0.251095 0.810432 0.216933 900 5 2 0 sqrt 190 ... False {'n_estimators': 900, 'min_samples_split': 5, ... 0.807692 0.735577 0.730769 0.822115 0.778846 0.775000 0.036916 2
2 0.697334 0.270379 0.149600 0.099821 100 2 1 0 5 190 ... False {'n_estimators': 100, 'min_samples_split': 2, ... 0.812500 0.745192 0.745192 0.807692 0.764423 0.775000 0.029543 2
3 9.198190 0.221321 1.450120 0.195836 900 5 2 0 sqrt 40 ... False {'n_estimators': 900, 'min_samples_split': 5, ... 0.807692 0.735577 0.730769 0.822115 0.778846 0.775000 0.036916 2
4 1.671727 0.306434 0.339691 0.221298 200 2 2 0 sqrt 190 ... False {'n_estimators': 200, 'min_samples_split': 2, ... 0.812500 0.740385 0.725962 0.822115 0.769231 0.774038 0.038099 5
5 2.151842 0.427804 0.186900 0.081329 300 2 1 0 5 70 ... False {'n_estimators': 300, 'min_samples_split': 2, ... 0.798077 0.745192 0.745192 0.807692 0.774038 0.774038 0.025979 5
6 2.228039 0.353992 0.444810 0.239826 300 2 1 0 sqrt 220 ... False {'n_estimators': 300, 'min_samples_split': 2, ... 0.798077 0.745192 0.745192 0.807692 0.774038 0.774038 0.025979 5
7 7.619214 1.714232 1.678509 1.488194 600 2 2 0 sqrt 40 ... False {'n_estimators': 600, 'min_samples_split': 2, ... 0.812500 0.740385 0.735577 0.812500 0.769231 0.774038 0.033447 5
8 3.236939 0.560252 0.803450 0.320552 500 5 2 0 sqrt 130 ... False {'n_estimators': 500, 'min_samples_split': 5, ... 0.798077 0.725962 0.730769 0.822115 0.788462 0.773077 0.038148 9
9 5.749816 0.753724 1.266811 0.701573 900 5 2 0 sqrt 10 ... False {'n_estimators': 900, 'min_samples_split': 5, ... 0.802885 0.740385 0.730769 0.822115 0.769231 0.773077 0.035119 9

10 rows × 21 columns

Notice that also other models behave similarly and have similar parameters. For instance, we see that in the top peforming models we have always bootstrap = False. This gives us some insights about the space of hyperparameters and where it is better to search.

We could also plot the average performance for each value of hyperparameter in our search.

In [ ]:
#modify the dataframe
rs_results = rs_results.drop([
            'mean_fit_time', 
            'std_fit_time', 
            'mean_score_time',
            'std_score_time', 
            'params', 
            'split0_test_score', 
            'split1_test_score', 
            'split2_test_score', 
            'std_test_score'],
            axis=1)
In [ ]:
#plotting avearage performance for each parameter
fig, axs = plt.subplots(ncols=4, nrows=2, figsize = (20,15))
sns.set(style="whitegrid", color_codes=True, font_scale = 1)

sns.barplot(x='param_n_estimators', y='mean_test_score', data=rs_results, ax=axs[0,0], color='lightgrey')
axs[0,0].set_title(label = 'n_estimators', size=20, weight='bold')

sns.barplot(x='param_min_samples_split', y='mean_test_score', data=rs_results, ax=axs[0,1], color='coral')
axs[0,1].set_title(label = 'min_samples_split', size=20, weight='bold')

sns.barplot(x='param_min_samples_leaf', y='mean_test_score', data=rs_results, ax=axs[0,2], color='lightgreen')
axs[0,2].set_title(label = 'min_samples_leaf', size=20, weight='bold')

sns.barplot(x='param_min_impurity_decrease', y='mean_test_score', data=rs_results, ax=axs[0,3], color='lightgrey')
axs[0,3].set_title(label = 'min_impurity_decrease', size=20, weight='bold')

sns.barplot(x='param_max_features', y='mean_test_score', data=rs_results, ax=axs[1,0], color='wheat')
axs[1,0].set_title(label = 'max_features', size=20, weight='bold')

sns.barplot(x='param_max_depth', y='mean_test_score', data=rs_results, ax=axs[1,1], color='lightpink')
axs[1,1].set_title(label = 'max_depth', size=20, weight='bold')

sns.barplot(x='param_bootstrap',y='mean_test_score', data=rs_results, ax=axs[1,2], color='skyblue')
axs[1,2].set_title(label = 'bootstrap', size=20, weight='bold')

sns.barplot(x='param_criterion', y='mean_test_score', data=rs_results, ax=axs[1,3], color='wheat')
axs[1,3].set_title(label = 'criterion', size=20, weight='bold')

plt.show()

Looking at the above plots, we can extract some insights about how well each value for each hyperparemeter performed on average. In particular we see that:

  • n_estimators: it performs the best around 400 trees
  • min_sample_split: it performs the best around 10 split
  • min_samples_leaf: there seems to be an increasing trend near 12 samples. It performs good also with 1.
  • min_impurity_decrease: there seems to be a clear decreasing trend, and the best is achieved for 0
  • max_features: there seems to be a slight incrasing trend. It performs better for 20 features
  • max_depth: it seeems to behave good around 40, 100 and 180 features.
  • bootstrap: the performance seems to be similar both for True and False. However, since the best models have bootstrap = False we take only False.
  • criterion: it performs better with entropy

Therefore, based on them, we can move to another finer hyperparameter tuning, using Grid Search. We will also include in the parameters for Grid Search the ones of the best model found.

In [ ]:
#definition of the parameters to try for Grid Search
from sklearn.model_selection import GridSearchCV

n_estimators = [300, 400, 450]
criterion = ['entropy']
max_depth = [40, 100, 180, 190, 220]
min_samples_split = [2, 8, 10, 12]
min_samples_leaf = [1, 2, 12, 15]
max_features = ['sqrt', 15, 20]
min_impurity_decrease = [0]
bootstrap = [False] 

param_grid = {
    'n_estimators': n_estimators,
    'criterion': criterion,
    'max_depth': max_depth,
    'min_samples_split': min_samples_split,
    'min_samples_leaf': min_samples_leaf,
    'max_features': max_features,
    'min_impurity_decrease': min_impurity_decrease,
    'bootstrap': bootstrap
}
In [ ]:
#performing grid search CV
rf_grid = RandomForestClassifier(random_state = 42, n_jobs = -1, warm_start = True)
rf_grid_search = GridSearchCV(rf_grid, param_grid, cv=5, scoring = 'accuracy', verbose = 1, n_jobs = -1, return_train_score=True)
rf_grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 720 candidates, totalling 3600 fits
Out[ ]:
GridSearchCV(cv=5,
             estimator=RandomForestClassifier(n_jobs=-1, random_state=42,
                                              warm_start=True),
             n_jobs=-1,
             param_grid={'bootstrap': [False], 'criterion': ['entropy'],
                         'max_depth': [40, 100, 180, 190, 220],
                         'max_features': ['sqrt', 15, 20],
                         'min_impurity_decrease': [0],
                         'min_samples_leaf': [1, 2, 12, 15],
                         'min_samples_split': [2, 8, 10, 12],
                         'n_estimators': [300, 400, 450]},
             return_train_score=True, scoring='accuracy', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=RandomForestClassifier(n_jobs=-1, random_state=42,
                                              warm_start=True),
             n_jobs=-1,
             param_grid={'bootstrap': [False], 'criterion': ['entropy'],
                         'max_depth': [40, 100, 180, 190, 220],
                         'max_features': ['sqrt', 15, 20],
                         'min_impurity_decrease': [0],
                         'min_samples_leaf': [1, 2, 12, 15],
                         'min_samples_split': [2, 8, 10, 12],
                         'n_estimators': [300, 400, 450]},
             return_train_score=True, scoring='accuracy', verbose=1)
RandomForestClassifier(n_jobs=-1, random_state=42, warm_start=True)
RandomForestClassifier(n_jobs=-1, random_state=42, warm_start=True)

Once we have trained it, we can actually obtain the best parameters and check their performance:

In [ ]:
#obtaining the best parameters
print("Best parameters: ", rf_grid_search.best_params_)
rf_gs_best_model = rf_grid_search.best_estimator_
Best parameters:  {'bootstrap': False, 'criterion': 'entropy', 'max_depth': 40, 'max_features': 'sqrt', 'min_impurity_decrease': 0, 'min_samples_leaf': 2, 'min_samples_split': 2, 'n_estimators': 300}
In [ ]:
#performance of best model in test set
rf_gs_best_model.fit(X_train, y_train)
y_pred = rf_gs_best_model.predict(X_test)
rf_gs_best_model_metrics = performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL -- 
General statistics:
 - Accuracy: 0.82692
 - Precision: 0.82692
 - Recall: 0.82692
 - F1 score:  0.82692
 - AUC Score:  0.88454
 - Matthews_corrcoef: 0.770
              precision    recall  f1-score   support

           0       0.85      0.91      0.88        67
           1       0.81      0.87      0.84        62
           2       0.81      0.75      0.78        63
           3       0.84      0.78      0.81        68

    accuracy                           0.83       260
   macro avg       0.83      0.83      0.83       260
weighted avg       0.83      0.83      0.83       260


Confusion matrix:

Notice that it performs slightly better than the one obtained by Random Search. In particular we have an increase in accuracy of $0.15$, and also the other metrics have improved. Moreover, comparing it with the previous classifiers, we see that what seems to have improved is the ability of the model to classify correctly the 2 and 3 class. Also, when we look at the statistics only for the 0 class, there aren't improvements of precision, recall and f1-score.

Overall, this model performs well.

Before moving on, let's visualize how the accuracy changes in the best models.

In [ ]:
#giving the performance of the best 10 models found
rf_gs_results = pd.DataFrame(rf_grid_search.cv_results_).sort_values('rank_test_score').reset_index(drop=True)
top_10_parameters= rf_gs_results.iloc[:10, :]

models = {}
for i in top_10_parameters.index:
    models[str(top_10_parameters.loc[i, 'params'])] = top_10_parameters.loc[i, 'split0_test_score':'split4_test_score'].values

results = []
names = []
for model, score in models.items():
	results.append(score)
	names.append(model)
plt.boxplot(results, showmeans = True)
plt.title("Cross-validated accuracy of best models")
plt.show()
In [ ]:
top_10_parameters
Out[ ]:
mean_fit_time std_fit_time mean_score_time std_score_time param_bootstrap param_criterion param_max_depth param_max_features param_min_impurity_decrease param_min_samples_leaf ... mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
0 2.492730 0.370648 1.919265 0.143564 False entropy 180 sqrt 0 2 ... 0.778846 0.033722 1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
1 2.452039 1.305935 1.910090 0.785296 False entropy 220 sqrt 0 2 ... 0.778846 0.033722 1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
2 3.820179 1.272117 1.464482 1.116295 False entropy 190 sqrt 0 2 ... 0.778846 0.033722 1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
3 3.531351 1.357857 1.573789 1.329590 False entropy 100 sqrt 0 2 ... 0.778846 0.033722 1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
4 2.374845 0.972114 2.310219 0.874939 False entropy 40 sqrt 0 2 ... 0.778846 0.033722 1 1.0 1.0 1.0 1.0 1.0 1.0 0.0
5 5.184129 0.466838 0.473733 0.211612 False entropy 180 sqrt 0 1 ... 0.776923 0.033530 6 1.0 1.0 1.0 1.0 1.0 1.0 0.0
6 4.220508 0.932175 0.957039 0.629906 False entropy 40 sqrt 0 1 ... 0.776923 0.033530 6 1.0 1.0 1.0 1.0 1.0 1.0 0.0
7 5.226416 0.349563 0.397337 0.151333 False entropy 220 sqrt 0 1 ... 0.776923 0.033530 6 1.0 1.0 1.0 1.0 1.0 1.0 0.0
8 5.572690 0.456409 0.554517 0.443839 False entropy 100 sqrt 0 1 ... 0.776923 0.033530 6 1.0 1.0 1.0 1.0 1.0 1.0 0.0
9 5.836584 0.316536 0.344877 0.320955 False entropy 190 sqrt 0 1 ... 0.776923 0.033530 6 1.0 1.0 1.0 1.0 1.0 1.0 0.0

10 rows × 28 columns

Notice that the best performing model have similar characteristics.

We could also visualize better the process of Grid Search. We do this by seeing how the score changes when modifying only one hyperparameter at a time, and keeping all the others fixed, starting from the best model found.

In [ ]:
#plotting the score as the parameter changes
def plot_search_results(grid):
    results = grid.cv_results_
    means_test = results['mean_test_score']
    stds_test = results['std_test_score']

    masks=[]
    masks_names= list(grid.best_params_.keys())
    for p_k, p_v in grid.best_params_.items():
        masks.append(list(results['param_'+p_k].data==p_v))
    params=grid.param_grid

    fig, ax = plt.subplots(2, int(len(params)/2), figsize=(28,10))
    fig.suptitle('Score per parameter')
    fig.text(0.08, 0.5, 'CV Average Score', va='center', rotation='vertical')
    pram_preformace_in_best = {}
    
    j = 0
    for i, p in enumerate(masks_names):
        m = np.stack(masks[:i] + masks[i+1:])
        pram_preformace_in_best
        best_parms_mask = m.all(axis=0)
        best_index = np.where(best_parms_mask)[0]
        x = np.array(params[p])
        y_1 = np.array(means_test[best_index])
        e_1 = np.array(stds_test[best_index])
        if(i >= int(len(params)/2)):
            i = i - int(len(params)/2)
            j = 1
        ax[j, i].errorbar(x, y_1, e_1, linestyle='--', marker='o', label='test')
        ax[j, i].set_title(p)

    plt.rc('xtick', labelsize=5) 
    plt.rc('ytick', labelsize=5) 
    plt.legend(loc="best", fontsize=15)
    plt.show()

plot_search_results(rf_grid_search)

These visuals gives us a good idea on how the hyperparameters make the performance change.

Best model found¶

Now, let's save this best model that we have obtained.

In [ ]:
#best classifier of random forest
rf_best_classifier = rf_gs_best_model

rf_params = rf_best_classifier.get_params()
for param, value in rf_params.items():
    print(param, ":", value)
bootstrap : False
ccp_alpha : 0.0
class_weight : None
criterion : entropy
max_depth : 40
max_features : sqrt
max_leaf_nodes : None
max_samples : None
min_impurity_decrease : 0
min_samples_leaf : 2
min_samples_split : 2
min_weight_fraction_leaf : 0.0
n_estimators : 300
n_jobs : -1
oob_score : False
random_state : 42
verbose : 0
warm_start : True

We notice that, with respect to our initial classifier, there is an improvement. This is actually what we expected when doing cross validation. Now let's save the results in a dataframe.

In [ ]:
#saving results of best model in dataframe
best_classifiers = pd.DataFrame(columns = ['Random Forest', 'Logistic Regression'], index = ['parameters', 'accuracy', 'precision', 'recall', 'f1', 'roc_auc', 'matthews', 'confusion'])
rf_gs_best_model_metrics['parameters'] = rf_params
best_classifiers['Random Forest'] = rf_gs_best_model_metrics

Now we could compute which features were the most relevant in our classification.

In [ ]:
# Create a series containing feature importances from the model and feature names from the training data
feature_importances = pd.Series(rf_best_classifier.feature_importances_, index=X_train.columns).sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(13, 6))
feature_importances.plot.bar(ax=ax);
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Most Important Features')
plt.rc('xtick', labelsize=20) 
plt.rc('ytick', labelsize=20) 
plt.show()

We can also visualize the resulting forest. For instance, let's visualize the first tree in the random forest.

In [ ]:
#visualize one of the tree in random forest
from sklearn.tree import plot_tree

fig = plt.figure(figsize=(15, 10))
plot_tree(rf_best_classifier.estimators_[0],                                    
          feature_names = df.columns.astype(str),
          class_names = df_label.unique().astype(str), 
          filled=True, rounded=True)

plt.show()

Logistic Regression¶

Now we want to perform logistic regression. Let's recall that logistic regression is a supervised learning algorithm used for binary classification tasks. It is a generalized linear model, and it uses the logistic function in order to estimate the probability of an instance belonging to a certain class.

Differently from Random Forests, this algorithm does not handle directly multiclass classification. Therefore, in order to implement it, we could do the following:

  • OvA (One-versus-All): We create a separate binary classifier for each class, so we train N binary classifier, one for each class. To make predictions, each binary classifier is applied to the input instance, and the class with the highest predicted probability is selected as the final prediction.
  • Multinomial approach: Here we do not classify each class separately. In this case, a single classifier is trained to directly predict the probability or likelihood of each class. The classifier's output is a probability distribution over all possible classes, indicating the likelihood of an instance belonging to each class. So we are directly adapting the logistic regression to multi-classification problems. It does so by estimating the probability of each class using a softmax activation function. To make predictions, the class with the highest probability is selected as the predicted class.
  • OvO (One-versus-One): We create a separate binary classifier for each pair of classes, so we train $\frac{N(N-1)}{2}$ binary classifiers. To make predictions, each binary classifier is applied to the input instance, and a voting mechanism is used to determine the final prediction. Each classifier contributes one vote to its predicted class. The class with the most votes is selected as the final prediction.

Now we want to check which of this approaches is best.

Applying the model¶

Let's start by using the OvR (that is One-versus-Rest), and let's see how it performs on the test set.

In [ ]:
#apply OvR classification
lr_classifier = LogisticRegression(multi_class='ovr', solver='liblinear')
lr_classifier.fit(X_train, y_train)
Out[ ]:
LogisticRegression(multi_class='ovr', solver='liblinear')
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression(multi_class='ovr', solver='liblinear')
In [ ]:
#OvR classification performance
y_pred = lr_classifier.predict(X_test) 
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL -- 
General statistics:
 - Accuracy: 0.76154
 - Precision: 0.76154
 - Recall: 0.76154
 - F1 score:  0.76154
 - AUC Score:  0.84025
 - Matthews_corrcoef: 0.682
              precision    recall  f1-score   support

           0       0.85      0.90      0.87        67
           1       0.77      0.77      0.77        62
           2       0.68      0.63      0.66        63
           3       0.74      0.74      0.74        68

    accuracy                           0.76       260
   macro avg       0.76      0.76      0.76       260
weighted avg       0.76      0.76      0.76       260


Confusion matrix:
Out[ ]:
accuracy                                              0.761538
precision                                             0.761538
recall                                                0.761538
f1                                                    0.761538
roc_auc                                               0.840254
matthews                                              0.682003
confusion    <sklearn.metrics._plot.confusion_matrix.Confus...
dtype: object

In general, we can notice that the inital accuracy that we got is lower than the inital one when doing random forest. This suggests that Logistic Regression, in general, performs worse than Random Forest. Moreover, as before, we see that the model performs pretty well for the classification of the 0 and 1 class, while it performs worse with the 2 and 3 class and it misclassifies the data mainly between these 2.

Now, we can also see how the classification performs using the multinomial approach.

In [ ]:
#apply multinomial classification and check the performance
lr_classifier = LogisticRegression(multi_class='multinomial',solver='newton-cg')
lr_classifier = lr_classifier.fit(X_train, y_train)
y_pred = lr_classifier.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL -- 
General statistics:
 - Accuracy: 0.77308
 - Precision: 0.77308
 - Recall: 0.77308
 - F1 score:  0.77308
 - AUC Score:  0.84789
 - Matthews_corrcoef: 0.697
              precision    recall  f1-score   support

           0       0.84      0.87      0.85        67
           1       0.80      0.77      0.79        62
           2       0.71      0.67      0.69        63
           3       0.74      0.78      0.76        68

    accuracy                           0.77       260
   macro avg       0.77      0.77      0.77       260
weighted avg       0.77      0.77      0.77       260


Confusion matrix:
Out[ ]:
accuracy                                              0.773077
precision                                             0.773077
recall                                                0.773077
f1                                                    0.773077
roc_auc                                               0.847886
matthews                                              0.697377
confusion    <sklearn.metrics._plot.confusion_matrix.Confus...
dtype: object

Here we get a better result than before, and a more similar result to Random Forest without hyperparameter tuning. Let's see later how they perform compared to each other more in depth.

For the multinomial approach, we could also get the probabilities estimates for each class for the new data point. The output is a vector of probabilities for each class.

In [ ]:
#probabilites of belonging to a class for each sample
probabilities = lr_classifier.predict_proba(X_test)
print("Probabilities:", probabilities)
Probabilities: [[2.69426554e-05 1.68496453e-03 2.69354327e-01 7.28933766e-01]
 [6.71111374e-03 9.43731948e-01 2.37474413e-02 2.58094965e-02]
 [2.45995421e-04 2.42185519e-02 9.35936052e-01 3.95994007e-02]
 ...
 [4.74515363e-03 9.07826831e-01 1.43233600e-02 7.31046555e-02]
 [8.11863476e-03 2.65820090e-01 6.74353698e-01 5.17075771e-02]
 [6.76379977e-01 2.26857677e-01 3.59868197e-02 6.07755264e-02]]

We can also see how the classifier obtained using OvO performs.

In [ ]:
#apply OvO classification and check the performance
from sklearn.multiclass import OneVsOneClassifier

OvO_clf = OneVsOneClassifier(LogisticRegression())
OvO_clf.fit(X_train, y_train)
y_pred = OvO_clf.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL -- 
General statistics:
 - Accuracy: 0.75769
 - Precision: 0.75769
 - Recall: 0.75769
 - F1 score:  0.75769
 - AUC Score:  0.83736
 - Matthews_corrcoef: 0.677
              precision    recall  f1-score   support

           0       0.86      0.88      0.87        67
           1       0.77      0.74      0.75        62
           2       0.69      0.63      0.66        63
           3       0.71      0.76      0.74        68

    accuracy                           0.76       260
   macro avg       0.76      0.76      0.76       260
weighted avg       0.76      0.76      0.76       260


Confusion matrix:
Out[ ]:
accuracy                                              0.757692
precision                                             0.757692
recall                                                0.757692
f1                                                    0.757692
roc_auc                                               0.837362
matthews                                              0.676922
confusion    <sklearn.metrics._plot.confusion_matrix.Confus...
dtype: object

Here, it performs the worst that we have seen so far.

In general, the performance of Logistic Regression seems to be worse than the one of random forests. This may be due to the fact that the data that we are dealing with doesn't seem to be linearly separable. However, before concluding this, first let's try fine tuning our model.

Hyperparameter selection¶

In this case, we have decided just to use Grid Search CV, since the process is overall much quicker than Random Forest. Therefore, we do not need to pass through Randomized Search as before, but we could directly pass to this more exhaustive approach.

In [ ]:
#definition of parameters to try for grid search
penalty = ['l2', 'l1', 'elasticnet', 'None']
tol = [1e-1, 1e-2, 1e-3, 1e-4, 1e-5]
C = [0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10]
solver = ['lbfgs', 'liblinear', 'newton-cholesky', 'saga']
max_iter = [int(x) for x in np.linspace(10, 1000, num=10)]
multi_class = ['ovr', 'multinomial']
l1_ratio = [x for x in np.linspace(0, 1, num=3)]

param_grid = {
    'penalty': penalty,
    'tol': tol,
    'C': C,
    'solver': solver,
    'max_iter': max_iter,
    'multi_class': multi_class,
    'l1_ratio': l1_ratio
}
In [ ]:
#apply grid search CV
lr_grid = LogisticRegression(random_state = 42)
lr_grid_search = GridSearchCV(lr_grid, param_grid, cv = 5, scoring = 'accuracy', return_train_score=True, verbose = 1)
lr_grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 38400 candidates, totalling 192000 fits
Out[ ]:
GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10],
                         'l1_ratio': [0.0, 0.5, 1.0],
                         'max_iter': [10, 120, 230, 340, 450, 560, 670, 780,
                                      890, 1000],
                         'multi_class': ['ovr', 'multinomial'],
                         'penalty': ['l2', 'l1', 'elasticnet', 'None'],
                         'solver': ['lbfgs', 'liblinear', 'newton-cholesky',
                                    'saga'],
                         'tol': [0.1, 0.01, 0.001, 0.0001, 1e-05]},
             return_train_score=True, scoring='accuracy', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5, estimator=LogisticRegression(),
             param_grid={'C': [0.001, 0.01, 0.1, 0.5, 1, 2, 5, 10],
                         'l1_ratio': [0.0, 0.5, 1.0],
                         'max_iter': [10, 120, 230, 340, 450, 560, 670, 780,
                                      890, 1000],
                         'multi_class': ['ovr', 'multinomial'],
                         'penalty': ['l2', 'l1', 'elasticnet', 'None'],
                         'solver': ['lbfgs', 'liblinear', 'newton-cholesky',
                                    'saga'],
                         'tol': [0.1, 0.01, 0.001, 0.0001, 1e-05]},
             return_train_score=True, scoring='accuracy', verbose=1)
LogisticRegression()
LogisticRegression()

Once we have trained it, we can actually obtain the best parameters and check their performance:

In [ ]:
#obtain the best parameters
print("Best parameters: ", lr_grid_search.best_params_)
lr_gs_best_model = lr_grid_search.best_estimator_
Best parameters:  {'C': 0.1, 'l1_ratio': 0.0, 'max_iter': 120, 'multi_class': 'multinomial', 'penalty': 'l1', 'solver': 'saga', 'tol': 0.01}
In [ ]:
#check the performance of the model 
lr_gs_best_model.fit(X_train, y_train)
y_pred = lr_gs_best_model.predict(X_test)
lr_gs_best_model_metrics = performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL -- 
General statistics:
 - Accuracy: 0.78077
 - Precision: 0.78077
 - Recall: 0.78077
 - F1 score:  0.78077
 - AUC Score:  0.85287
 - Matthews_corrcoef: 0.708
              precision    recall  f1-score   support

           0       0.86      0.93      0.89        67
           1       0.82      0.79      0.80        62
           2       0.68      0.63      0.66        63
           3       0.75      0.76      0.76        68

    accuracy                           0.78       260
   macro avg       0.78      0.78      0.78       260
weighted avg       0.78      0.78      0.78       260


Confusion matrix:

We see an improvement with respect to the initial multinomial model of $0.07$. So it seems that this type of classifier can not learn much more from the data. The resulting performance of the model is worse than the one of Random Forest in all metrics; for instance in accuracy we have a change from $0.827$ to $0.781$. Moreover, we see that for instance the 2 and 3 class are more misclassified between them.

Now let's visualize how the accuracy changes in the best models.

In [ ]:
#visualization of average accuracy score of best models
lr_gs_results = pd.DataFrame(lr_grid_search.cv_results_).sort_values('rank_test_score').reset_index(drop=True)
top_10_parameters= lr_gs_results.iloc[:10, :]

models = {}
for i in top_10_parameters.index:
    models[str(top_10_parameters.loc[i, 'params'])] = top_10_parameters.loc[i, 'split0_test_score':'split4_test_score'].values

results = []
names = []
for model, score in models.items():
	results.append(score)
	names.append(model)
plt.boxplot(results, showmeans = True)
plt.title("Average accuracy score of the best models")
plt.show()

Many models seems to behave similarly in our dataset.

In [ ]:
top_10_parameters
Out[ ]:
mean_fit_time std_fit_time mean_score_time std_score_time param_C param_l1_ratio param_max_iter param_multi_class param_penalty param_solver ... mean_test_score std_test_score rank_test_score split0_train_score split1_train_score split2_train_score split3_train_score split4_train_score mean_train_score std_train_score
0 0.032924 0.003818 0.001782 0.000409 0.1 0.5 670 multinomial l1 saga ... 0.760577 0.028099 1 0.771635 0.778846 0.776442 0.771635 0.772837 0.774279 0.002885
1 0.040890 0.007960 0.002195 0.000399 0.1 0.0 120 multinomial l1 saga ... 0.760577 0.028099 1 0.771635 0.778846 0.777644 0.771635 0.772837 0.774519 0.003097
2 0.042094 0.007462 0.001790 0.000752 0.1 1.0 780 multinomial l1 saga ... 0.760577 0.028099 1 0.771635 0.778846 0.777644 0.771635 0.772837 0.774519 0.003097
3 0.035490 0.004170 0.002400 0.000497 0.1 0.0 340 multinomial l1 saga ... 0.760577 0.028099 1 0.772837 0.778846 0.776442 0.772837 0.772837 0.774760 0.002475
4 0.036688 0.000976 0.002001 0.000630 0.1 0.0 780 multinomial l1 saga ... 0.760577 0.028099 1 0.772837 0.780048 0.776442 0.771635 0.772837 0.774760 0.003097
5 0.035511 0.007932 0.001597 0.000790 0.1 1.0 1000 multinomial l1 saga ... 0.760577 0.028099 1 0.772837 0.778846 0.777644 0.771635 0.772837 0.774760 0.002905
6 0.039301 0.008711 0.001988 0.000631 0.1 1.0 450 multinomial elasticnet saga ... 0.760577 0.028099 1 0.772837 0.778846 0.776442 0.771635 0.772837 0.774519 0.002698
7 0.038313 0.003195 0.001787 0.000756 0.1 0.0 450 multinomial l1 saga ... 0.760577 0.028099 1 0.771635 0.778846 0.777644 0.771635 0.772837 0.774519 0.003097
8 0.038688 0.007938 0.001796 0.000990 0.1 0.0 560 multinomial l1 saga ... 0.760577 0.028099 1 0.772837 0.778846 0.776442 0.771635 0.772837 0.774519 0.002698
9 0.034514 0.009134 0.001796 0.000746 0.1 1.0 230 multinomial elasticnet saga ... 0.760577 0.028099 1 0.772837 0.778846 0.776442 0.771635 0.772837 0.774519 0.002698

10 rows × 27 columns

Notice that the parameters of the best performing models are similar.

We could also understand more how the parameters behave, when we keep fixed all of the others in the best model we found.

In [ ]:
#plotting the score as the parameter changes
def plot_search_results(grid):
    results = grid.cv_results_
    means_test = results['mean_test_score']
    stds_test = results['std_test_score']

    masks=[]
    masks_names= list(grid.best_params_.keys())
    for p_k, p_v in grid.best_params_.items():
        masks.append(list(results['param_'+p_k].data==p_v))
    params=grid.param_grid

    fig, ax = plt.subplots(2, int(len(params)/2 + 1), figsize=(24,8))
    fig.suptitle('Score per parameter')
    fig.text(0.08, 0.5, 'CV Average Score', va='center', rotation='vertical')
    pram_preformace_in_best = {}
    
    j = 1
    for i, p in enumerate(masks_names):
        m = np.stack(masks[:i] + masks[i+1:])
        pram_preformace_in_best
        best_parms_mask = m.all(axis=0)
        best_index = np.where(best_parms_mask)[0]
        x = np.array(params[p])
        y_1 = np.array(means_test[best_index])
        e_1 = np.array(stds_test[best_index])
        if(i >= int(len(params)/2)):
            i = i - int(len(params)/2)
            j = 0
        ax[j, i].errorbar(x, y_1, e_1, linestyle='--', marker='o', label='test')
        ax[j, i].set_title(p)

    plt.rc('ytick', labelsize=10)
    plt.rc('xtick', labelsize=10)
    plt.legend(loc="best", fontsize=15)
    plt.show()

plot_search_results(lr_grid_search)
No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.

Let's apply the same thing also on Ovo. We'll be quicker.

In [ ]:
#definition of parameters for grid search and grid search
penalty = ['l2', 'l1', 'elasticnet', 'None']
tol = [1e-2, 1e-3, 1e-4]
C = [0.01, 0.1, 0.5, 1, 2]
solver = ['lbfgs', 'liblinear', 'newton-cholesky', 'saga']
max_iter = [250, 300]
multi_class = ['ovr', 'multinomial']

param_grid = {
    'estimator__penalty': penalty,
    'estimator__tol': tol,
    'estimator__C': C,
    'estimator__solver': solver,
    'estimator__max_iter': max_iter,
    'estimator__multi_class': multi_class
}

ovo_grid = OneVsOneClassifier(LogisticRegression(random_state = 42))
ovo_grid_search = GridSearchCV(ovo_grid, param_grid, cv = 5, scoring = 'accuracy', return_train_score=True, verbose = 1)
ovo_grid_search.fit(X_train, y_train)
Fitting 5 folds for each of 960 candidates, totalling 4800 fits
Out[ ]:
GridSearchCV(cv=5,
             estimator=OneVsOneClassifier(estimator=LogisticRegression(random_state=42)),
             param_grid={'estimator__C': [0.01, 0.1, 0.5, 1, 2],
                         'estimator__max_iter': [250, 300],
                         'estimator__multi_class': ['ovr', 'multinomial'],
                         'estimator__penalty': ['l2', 'l1', 'elasticnet',
                                                'None'],
                         'estimator__solver': ['lbfgs', 'liblinear',
                                               'newton-cholesky', 'saga'],
                         'estimator__tol': [0.01, 0.001, 0.0001]},
             return_train_score=True, scoring='accuracy', verbose=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
GridSearchCV(cv=5,
             estimator=OneVsOneClassifier(estimator=LogisticRegression(random_state=42)),
             param_grid={'estimator__C': [0.01, 0.1, 0.5, 1, 2],
                         'estimator__max_iter': [250, 300],
                         'estimator__multi_class': ['ovr', 'multinomial'],
                         'estimator__penalty': ['l2', 'l1', 'elasticnet',
                                                'None'],
                         'estimator__solver': ['lbfgs', 'liblinear',
                                               'newton-cholesky', 'saga'],
                         'estimator__tol': [0.01, 0.001, 0.0001]},
             return_train_score=True, scoring='accuracy', verbose=1)
OneVsOneClassifier(estimator=LogisticRegression(random_state=42))
LogisticRegression(random_state=42)
LogisticRegression(random_state=42)
In [ ]:
#getting the best parameters
print("best parameters: ", ovo_grid_search.best_params_)
ovo_gs_best_model = ovo_grid_search.best_estimator_
best parameters:  {'estimator__C': 0.5, 'estimator__max_iter': 250, 'estimator__multi_class': 'multinomial', 'estimator__penalty': 'l1', 'estimator__solver': 'saga', 'estimator__tol': 0.0001}
In [ ]:
#checking the performance of the model
ovo_gs_best_model.fit(X_train, y_train)
y_pred = ovo_gs_best_model.predict(X_test)
performance_metric(y_test, y_pred)
-- PERFORMANCE OF THE MODEL -- 
General statistics:
 - Accuracy: 0.76538
 - Precision: 0.76538
 - Recall: 0.76538
 - F1 score:  0.76538
 - AUC Score:  0.84253
 - Matthews_corrcoef: 0.687
              precision    recall  f1-score   support

           0       0.86      0.90      0.88        67
           1       0.78      0.76      0.77        62
           2       0.69      0.63      0.66        63
           3       0.72      0.76      0.74        68

    accuracy                           0.77       260
   macro avg       0.76      0.76      0.76       260
weighted avg       0.76      0.77      0.76       260


Confusion matrix:
Out[ ]:
accuracy                                              0.765385
precision                                             0.765385
recall                                                0.765385
f1                                                    0.765385
roc_auc                                               0.842526
matthews                                              0.687176
confusion    <sklearn.metrics._plot.confusion_matrix.Confus...
dtype: object

Here, we see that the result performs worse than the multinomial approach (we have a change in accuracy from $0.781$ to $0.765$). Therefore, our of all the 3 approaches, the one that performs the best is the multinomial one, but it still doesn't perform as well as Random Forest.

Best model found¶

Now let's save the best model that we found.

In [ ]:
#best classifier of logistic regression
lr_best_classifier = lr_gs_best_model
lr_params = lr_best_classifier.get_params()
for param, value in lr_params.items():
    print(param, ":", value)
C : 0.1
class_weight : None
dual : False
fit_intercept : True
intercept_scaling : 1
l1_ratio : 0.0
max_iter : 120
multi_class : multinomial
n_jobs : None
penalty : l1
random_state : None
solver : saga
tol : 0.01
verbose : 0
warm_start : False
In [ ]:
#saving results of best model in dataframe
lr_gs_best_model_metrics['parameters'] = lr_params
best_classifiers['Logistic Regression'] = lr_gs_best_model_metrics

We could also try to visualize the most important features in this case. To do, since we have a separate set of coefficients for each class, we are going to take the average of the absolute values of the coefficients across all classes.

In [ ]:
# Create a series containing feature importances from the model and feature names from the training data
coefficients = lr_best_classifier.coef_
avg_importance = np.mean(np.abs(coefficients), axis=0)

feature_importance2 = pd.Series(avg_importance, index=X_train.columns).sort_values(ascending=False)
fig, ax = plt.subplots(figsize=(13, 6))
feature_importance2.plot.bar(ax=ax);
plt.xlabel('Features')
plt.ylabel('Importance')
plt.title('Most Important Features')
plt.rc('xtick', labelsize=20) 
plt.rc('ytick', labelsize=20) 
plt.show()

Additional¶

In this part, we want to improve the accuracy of our classifiers. Since we have already done hyperparameter selection, we could try to use other methods in order to improve the performance.

The first thing that we want to focus on is in decreasing the number of features. In particular, extra features can decrease performance because they may “confuse” the model by giving it irrelevant data that prevents it from learning the actual relationships. So this is done in order to reduce the dimensionality of the data, avoid overfitting, and improve the interpretability of our data. The random forest performs implicit feature selection because it splits nodes on the most important variables, but other machine learning models do not. While here we have just around 30 features, that are not many, we would still like to see if some of them are not important for the classification.

We are going to use various methods in order to deal with this, and check whether this will bring some improvements in our classifier.

First of all, let's write a piece of code that will allow to compare efficiently random forest and logistic regression.

In [ ]:
def performance_metric2(y_t, y_p1, y_p2):
  print("-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION-- ")
  cm1 = confusion_matrix(y_t, y_p1)
  disp1 = ConfusionMatrixDisplay(confusion_matrix = cm1);

  cm2 = confusion_matrix(y_t, y_p2)
  disp2 = ConfusionMatrixDisplay(confusion_matrix = cm2);

  accuracy1 = accuracy_score(y_t, y_p1)
  precision1 = precision_score(y_t, y_p1, average = 'micro')
  recall1 = recall_score(y_t, y_p1, average = 'micro')
  f11 = f1_score(y_t, y_p1, average = 'micro')

  accuracy2 = accuracy_score(y_t, y_p2)
  precision2 = precision_score(y_t, y_p2, average = 'micro')
  recall2 = recall_score(y_t, y_p2, average = 'micro')
  f12 = f1_score(y_t, y_p2, average = 'micro')
  
  print("General statistics:")
  print("   RANDOM FOREST:                LOGISTIC REGRESSION:")
  print(" - Accuracy:", accuracy1.round(5), "                ", accuracy2.round(5))
  print(" - Precision:", precision1.round(5), "               ", precision2.round(5))
  print(" - Recall:", recall1.round(5), "                  ", recall2.round(5))
  print(" - F1 score:", f11.round(5), "                ", f12.round(5))

  roc_auc_multiclass1 = roc_auc_score_multiclass(y_t, y_p1)
  roc_auc_multiclass2 = roc_auc_score_multiclass(y_t, y_p2)

  temporary1 = []
  for i in roc_auc_multiclass1.values():
    temporary1.append(i)
  roc_auc_score1 = sum(temporary1) / len(temporary1)
  
  temporary2 = []
  for i in roc_auc_multiclass2.values():
    temporary2.append(i)
  roc_auc_score2 = sum(temporary2) / len(temporary2)

  print(" - AUC Score:", roc_auc_score1.round(5), "               ", roc_auc_score2.round(5))

  mc1 = matthews_corrcoef(y_t, y_p1)
  mc2 = matthews_corrcoef(y_t, y_p2)
  print(' - Matthews corrcoef:', mc1.round(5), "       ", mc2.round(5))

  print("\nConfusion matrix:")
  fig, ax = plt.subplots(1,2, figsize=(24,8))
  fig.suptitle('Confusion Matrix')
  disp1.plot(ax = ax[0])
  ax[0].set_title("Random Forest")
  disp2.plot(ax = ax[1])
  ax[1].set_title("Logistic Regression")
  ax[0].grid(False)
  ax[1].grid(False)
  plt.show()
Embedded Methods¶

Embedded Methods are methods for features selection that are incorporated within the model training process itself. These methods aim to select the most relevant features by considering their impact on the model's performance during training. So, instead of performing feature selection as a separate preprocessing step, embedded methods evaluate the importance or relevance of features directly within the model's optimization process.

Here we are going to use Random Forest, that is we are going to use the most relevant features obtained by our best performing random forest, in order to train Logistic Regression and Random Forest. For each subset, we are going to check their score on the test set, and check which performs better.

In [ ]:
#plotting the performance varying the number of feature to consider
def plot_performance(estimator1, estimator2, feat_importance):
    results_rf = []
    results_lr = []
    for i in range(1, len(X_train.columns)):
        X_train_modified = X_train[feat_importance.nlargest(i).index].copy()
        results_rf.append(cross_val_score(estimator1, X_train_modified, y_train, scoring = 'accuracy', cv = 5))
        results_lr.append(cross_val_score(estimator2, X_train_modified, y_train, scoring = 'accuracy', cv = 5))

    fig, ax = plt.subplots(1,2, figsize=(24,8))
    fig.suptitle('Score per number of features')
    
    for i in range(0, len(X_train.columns)-1):
        x = np.array(i+1)
        y_1 = np.array(np.mean(results_rf[i]))
        e_1 = np.array(np.std(results_rf[i]))
        y_2 = np.array(np.mean(results_lr[i]))
        e_2 = np.array(np.std(results_lr[i]))
        ax[0].errorbar(x, y_1, e_1, linestyle='--', marker='o')
        ax[0].set_title("Random Forest")
        ax[1].errorbar(x, y_2, e_2, linestyle='--', marker='o')
        ax[1].set_title("Logistic Regression")
    plt.show()
In [ ]:
#plotting how the models behave using the most iportant feature of random forest
feature_importances = pd.Series(rf_best_classifier.feature_importances_, index=X_train.columns).sort_values(ascending=False)
plot_performance(rf_best_classifier, lr_best_classifier, feature_importances)

In both cases, the best performance seems to happen when we have around 18 features.

In [ ]:
feature_importances = pd.Series(rf_best_classifier.feature_importances_, index=X_train.columns).sort_values(ascending=False)
feature_importances.nlargest(18).index
Out[ ]:
Index(['C', 'B', 'A', 'feature_24', 'feature_15', 'feature_7', 'feature_3',
       'feature_4', 'feature_5', 'feature_25', 'feature_21', 'feature_16',
       'feature_18', 'feature_30', 'feature_12', 'feature_2', 'feature_11',
       'feature_8'],
      dtype='object')
In [ ]:
#performance of random forest and logistic regression
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)

X_train_mod = X_train[feature_importances.nlargest(18).index].copy()
X_test_mod = X_test[feature_importances.nlargest(18).index].copy()
rf_best_classifier.fit(X_train_mod, y_train)
y_pred_mod1 = rf_best_classifier.predict(X_test_mod)

lr_best_classifier.fit(X_train_mod, y_train)
y_pred_mod2 = lr_best_classifier.predict(X_test_mod)
performance_metric2(y_test, y_pred_mod1, y_pred_mod2)
-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION-- 
General statistics:
   RANDOM FOREST:                LOGISTIC REGRESSION:
 - Accuracy: 0.83846                  0.79615
 - Precision: 0.83846                 0.79615
 - Recall: 0.83846                    0.79615
 - F1 score: 0.83846                  0.79615
 - AUC Score: 0.89203                 0.86342
 - Matthews corrcoef: 0.78549         0.72825

Confusion matrix:

Notice that we get an increase in the performance in both cases. In particular, our best random forest model passed from having an accuracy of $0.827$ to an accuracy of $0.838$, while the best logistic regression model passed from an accuracy of $0.781$ to an accuracy of $0.796$. Moreover, we see also that the other metrics have overall improved. In general, we still see that the performance of random forest is better than the one of logistic regression.

Filter methods¶

Filter methods are feature selection techniques that select features based on their intrinsic properties. In particular, these methods rely on statistical measures or scoring criteria to evaluate the relevance or importance of features, and then select a subset of features based on these scores. There are many measures that we could use to understand which features to keep. In particular, we could use:

  • Information Gain
  • Chi-Square test
  • Fisher's Score
  • Correlation Coefficient
  • Variance Threshold

Let's try to see how our model behaves using information gain.

In [ ]:
from sklearn.feature_selection import mutual_info_classif

importances = mutual_info_classif(X_train, y_train)
feat_importances = pd.Series(importances, X_train.columns).sort_values(ascending = False)
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)

plot_performance(rf_best_classifier, lr_best_classifier, feat_importances)

Notice that in this case we have the best performing model when we have all the features, so it isn't really useful to remove features based on the information gain in this case.

Wrapper Methods¶

Wrapper methods are feature selection techniques that select subsets of features by evaluating them through the actual model performance. They search the space of all possible subsets of features, assessing their quality by learning and evaluating a classifier with that feature subset. It follows a greedy search approach. We could do:

  • Forward Feature Selection
  • Backward Feature Elimination
  • Recursive Feature Elimination

Let's start with the Forward Feature Selection. It is an iterative approach that starts with an empty set of features and greedily adds the most important feature at each step based on the model's performance. We are going to perform it for Random Forest and Logistic Regression.

In [ ]:
#train the selector of features
from sklearn.feature_selection import SequentialFeatureSelector

#RANDOM FOREST
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
selector  = SequentialFeatureSelector(rf_best_classifier, direction = 'forward', scoring = 'accuracy', cv = 5, n_jobs = -1)
selector = selector.fit(X_train, y_train)

#get the features
mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of random forest: ", best_features)

#check the performance 
X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
rf_best_classifier.fit(X_train_modified, y_train)
y_pred_mod1 = rf_best_classifier.predict(X_test_modified)

#LOGISTIC REGRESSION
selector  = SequentialFeatureSelector(lr_best_classifier, direction = 'forward', scoring = 'accuracy', cv = 5, n_jobs = -1)
selector = selector.fit(X_train, y_train)

#get the features
mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of logistic regression: ", best_features)

#check the performance 
X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
lr_best_classifier.fit(X_train_modified, y_train)
y_pred_mod2 = lr_best_classifier.predict(X_test_modified)

performance_metric2(y_test, y_pred_mod1, y_pred_mod2)
Best features of random forest:  Index(['feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_7',
       'feature_11', 'feature_12', 'feature_15', 'feature_18', 'feature_19',
       'feature_24', 'feature_25', 'feature_30', 'A', 'B', 'C'],
      dtype='object')
Best features of logistic regression:  Index(['feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_7',
       'feature_11', 'feature_12', 'feature_14', 'feature_15', 'feature_18',
       'feature_21', 'feature_24', 'feature_25', 'A', 'B', 'C'],
      dtype='object')
-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION-- 
General statistics:
   RANDOM FOREST:                LOGISTIC REGRESSION:
 - Accuracy: 0.81923                  0.77692
 - Precision: 0.81923                 0.77692
 - Recall: 0.81923                    0.77692
 - F1 score: 0.81923                  0.77692
 - AUC Score: 0.87947                 0.85105
 - Matthews corrcoef: 0.75979         0.70276

Confusion matrix:

In this case, we get a decrease in the overall performance of both models, mainly due to the wrong classification of the 2 and 3 class. This might have happened since we have removed too many features from the model that were informative.

Let's also analyse the Backward Elimination Method. It is simlar to the previous one, but it works in the opposite direction. It starts with all features and iteratively removes the least important feature based on the model's performance. Let's repeat analogously of what we have done before.

In [ ]:
#check the performance of Backward Elimination Method
from sklearn.feature_selection import SequentialFeatureSelector

#RANDOM FOREST
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
selector  = SequentialFeatureSelector(rf_best_classifier, direction = 'backward', scoring = 'accuracy', cv = 5, n_jobs = -1)
selector = selector.fit(X_train, y_train)

mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of random forest: ", best_features)

X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
rf_best_classifier.fit(X_train_modified, y_train)
y_pred_mod1 = rf_best_classifier.predict(X_test_modified)

#LOGISTIC REGRESSION
selector  = SequentialFeatureSelector(lr_best_classifier, direction = 'backward', scoring = 'accuracy', cv = 5, n_jobs = -1)
selector = selector.fit(X_train, y_train)

mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of logistic regression: ", best_features)

X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
lr_best_classifier.fit(X_train_modified, y_train)
y_pred_mod2 = lr_best_classifier.predict(X_test_modified)

performance_metric2(y_test, y_pred_mod1, y_pred_mod2)
Best features of random forest:  Index(['feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_7',
       'feature_11', 'feature_15', 'feature_16', 'feature_18', 'feature_21',
       'feature_24', 'feature_25', 'feature_29', 'feature_30', 'A', 'B'],
      dtype='object')
Best features of logistic regression:  Index(['feature_3', 'feature_4', 'feature_5', 'feature_7', 'feature_11',
       'feature_12', 'feature_15', 'feature_16', 'feature_18', 'feature_21',
       'feature_23', 'feature_24', 'feature_25', 'A', 'B', 'C'],
      dtype='object')
-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION-- 
General statistics:
   RANDOM FOREST:                LOGISTIC REGRESSION:
 - Accuracy: 0.82308                  0.77308
 - Precision: 0.82308                 0.77308
 - Recall: 0.82308                    0.77308
 - F1 score: 0.82308                  0.77308
 - AUC Score: 0.88172                 0.84831
 - Matthews corrcoef: 0.76448         0.69761

Confusion matrix:

In this case we still get a worst performance in both models, even if slighter than before.

We could also do Recursive Feature Elimination. RFE also starts with all features and iteratively removes the least important feature(s) based on the model's performance. It is similar to Backward Elimination.

In [ ]:
from sklearn.feature_selection import RFECV

#RANDOM FOREST
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
selector  = RFECV(rf_best_classifier, cv = 5, scoring = 'accuracy', n_jobs = -1)
selector = selector.fit(X_train, y_train)

mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of random forest: ", best_features)

X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
rf_best_classifier.fit(X_train_modified, y_train)
y_pred_mod1 = rf_best_classifier.predict(X_test_modified)

#LOGISTIC REGRESSION
selector  = RFECV(lr_best_classifier, cv = 5, scoring = 'accuracy', n_jobs = -1)
selector = selector.fit(X_train, y_train)

mask = selector.get_support()
best_features = X_train.columns[mask]
print("Best features of logistic regression: ", best_features)

X_train_modified = X_train[best_features].copy()
X_test_modified = X_test[best_features].copy()
lr_best_classifier.fit(X_train_modified, y_train)
y_pred_mod2 = lr_best_classifier.predict(X_test_modified)

performance_metric2(y_test, y_pred_mod1, y_pred_mod2)
Best features of random forest:  Index(['feature_2', 'feature_3', 'feature_4', 'feature_5', 'feature_7',
       'feature_10', 'feature_11', 'feature_12', 'feature_15', 'feature_16',
       'feature_18', 'feature_21', 'feature_24', 'feature_25', 'feature_30',
       'A', 'B', 'C'],
      dtype='object')
Best features of logistic regression:  Index(['feature_1', 'feature_2', 'feature_3', 'feature_4', 'feature_5',
       'feature_7', 'feature_9', 'feature_11', 'feature_15', 'feature_16',
       'feature_18', 'feature_21', 'feature_23', 'feature_24', 'feature_25',
       'feature_26', 'feature_30', 'A', 'B', 'C'],
      dtype='object')
-- PERFORMANCE OF THE RANDOM FOREST AND LOGISTIC REGRESSION-- 
General statistics:
   RANDOM FOREST:                LOGISTIC REGRESSION:
 - Accuracy: 0.82692                  0.78846
 - Precision: 0.82692                 0.78846
 - Recall: 0.82692                    0.78846
 - F1 score: 0.82692                  0.78846
 - AUC Score: 0.88421                 0.8583
 - Matthews corrcoef: 0.76987         0.71787

Confusion matrix:

In this case, we don't get an improvement for Random Forest, while we get an improvement of $0.08$ for Logistic Regression.

PCA¶

Now we want to try and reduce the dimensionality of the data, to check how the model behaves, using PCA.

We are going to apply PCA to both models, varying the number of PCA at each step, and then see how the models perform.

In [ ]:
#visualize the perofrmance of the models wrt to different number of PC
ss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
y_train = np.array(y_train)
rf_best_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)

def plot_performance2(estimator1, estimator2):
    results_rf = []
    results_lr = []
    for i in range(1, len(X_train.columns)):
        pca = PCA(n_components = i)
        pca.fit(X_train_scaled)
        X_train_scaled_pca = pca.transform(X_train_scaled)
        results_rf.append(cross_val_score(estimator1, X_train_scaled_pca, y_train, scoring = 'accuracy', cv = 5))
        results_lr.append(cross_val_score(estimator2, X_train_scaled_pca, y_train, scoring = 'accuracy', cv = 5))

    fig, ax = plt.subplots(1, 2, figsize=(24,8))
    fig.suptitle('Score per number of features')

    for i in range(0, len(X_train.columns)-1):
        x = np.array(i + 1)
        y_1 = np.array(np.mean(results_rf[i]))
        e_1 = np.array(np.std(results_rf[i]))
        y_2 = np.array(np.mean(results_lr[i]))
        e_2 = np.array(np.std(results_lr[i]))
        ax[0].errorbar(x, y_1, e_1, linestyle='--', marker='o')
        ax[0].set_title("Random Forest")
        ax[1].errorbar(x, y_2, e_2, linestyle='--', marker='o')
        ax[1].set_title("Logistic Regression")

    plt.show()

plot_performance2(rf_best_classifier, lr_best_classifier)

Notice that the best performance is obtained when we have all PC's, so we cannot reduce the dataframe in this way. This can be explained due to the fact that the variance of the data is explained by lots of principal components.

Conclusion¶

After having perform this analysis, we clearly see that the model that overall performs better is Random Forest. This is in accordance with our expectation, since we have seen from PCA that the data was not linearly separable in 2D and in 3D. Moreover, we have seen that the classifier mainly struggles to distinguish between the 2 and 3 class.

In particular, the model that performs the best is the one of Random Forest, with parameters:

bootstrap : False

ccp_alpha : 0.0

class_weight : None

criterion : entropy

max_depth : 40

max_features : sqrt

max_leaf_nodes : None

max_samples : None

min_impurity_decrease : 0

min_samples_leaf : 2

min_samples_split : 2

min_weight_fraction_leaf : 0.0

n_estimators : 300

n_jobs : -1

oob_score : False

random_state : 42

verbose : 0

warm_start : True

and with reduced features:

['C', 'B', 'A', 'feature_24', 'feature_15', 'feature_7', 'feature_3', 'feature_4', 'feature_5', 'feature_25', 'feature_21', 'feature_16', 'feature_18', 'feature_30', 'feature_12', 'feature_2', 'feature_11', 'feature_8']

Test¶

Now we want to import the file containing the test features on which we will output the predictions of our best model.

In [ ]:
test_df = pd.read_csv("./data/mldata_0003164501.TEST_FEATURES.csv", sep = ",")
test_df
Out[ ]:
id feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 ... feature_22 feature_23 feature_24 feature_25 feature_26 feature_27 feature_28 feature_29 feature_30 categorical_feature_1
0 0 0.216251 1.600011 3.124432 1.007526 1.248892 0.048878 -1.063602 0.256748 -0.659555 ... 0.962239 -0.779614 -1.260093 0.092691 2.274102 1.574947 1.261300 -0.264117 4.560051 B
1 1 -1.084906 -1.087918 -7.716479 -1.935018 3.081061 0.365231 2.270539 -0.493430 1.310316 ... 0.642725 0.948476 0.156880 -0.949245 -0.200635 -1.980347 0.855991 -0.438245 2.587744 C
2 2 1.106605 -1.545707 -2.163345 -1.813406 0.203347 1.633511 0.791142 1.033637 -0.030198 ... -0.486569 -0.341142 -1.352579 1.379190 1.703867 -0.060112 -0.283798 0.661002 1.840297 C
3 3 -2.042224 0.823476 -2.285369 -2.270773 -0.668559 -1.578110 -0.257186 -0.827421 -1.256929 ... -0.960816 -1.070933 2.171056 0.204803 0.620946 -2.036895 -1.177620 -0.785916 0.173098 C
4 4 -0.581274 0.363198 -3.570156 -4.286992 5.044998 -0.397267 4.118269 0.691859 0.523798 ... 1.105811 -0.490820 0.952442 -3.772214 -1.089031 1.407776 -0.136638 0.089052 -0.307349 C
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1295 1295 1.831470 0.953282 -1.071169 -2.489733 1.812463 -0.374221 -3.150657 -2.100736 -0.854273 ... -1.182196 -0.744190 1.014653 -2.621318 0.962818 0.312637 1.541494 0.898536 -5.367114 B
1296 1296 -0.063951 -0.471322 3.985702 -2.218718 0.870567 -0.764831 0.873860 0.588270 -2.135085 ... 1.069542 1.230174 -0.155100 -2.571712 0.679225 -0.681569 -1.413776 0.201078 -3.400294 A
1297 1297 0.654051 -1.157193 -1.193945 -4.021244 -2.909455 0.086752 2.514698 -1.439090 -0.369860 ... 1.346860 1.038488 -0.288692 4.636710 0.463763 0.847172 -0.841730 1.457939 5.344950 C
1298 1298 0.759557 -4.173832 -0.045697 5.252229 -0.855698 -0.798322 2.746694 -1.207565 2.051560 ... 0.223734 -0.540067 2.701097 -2.223971 -2.755301 -1.551487 -0.050303 0.515680 -5.663085 A
1299 1299 0.694421 2.003003 2.290706 -0.327494 -0.445400 0.827480 0.778321 0.237986 0.690612 ... -0.080420 0.679049 -0.970881 1.244630 -0.486578 1.107635 -0.723676 1.403534 -0.416362 C

1300 rows × 32 columns

Let's apply the transformations needed in order to apply the classifier algorithm.

In [ ]:
test_df.set_index('id', drop = True, inplace = True)

encoder = OneHotEncoder(handle_unknown='ignore')
encoder_df = pd.DataFrame(encoder.fit_transform(test_df[['categorical_feature_1']]).toarray())
encoder_df.columns = ['A', 'B', 'C']

test_df = test_df.join(encoder_df)
test_df.drop('categorical_feature_1', axis = 1, inplace = True)
test_df
Out[ ]:
feature_1 feature_2 feature_3 feature_4 feature_5 feature_6 feature_7 feature_8 feature_9 feature_10 ... feature_24 feature_25 feature_26 feature_27 feature_28 feature_29 feature_30 A B C
id
0 0.216251 1.600011 3.124432 1.007526 1.248892 0.048878 -1.063602 0.256748 -0.659555 -0.733519 ... -1.260093 0.092691 2.274102 1.574947 1.261300 -0.264117 4.560051 0.0 1.0 0.0
1 -1.084906 -1.087918 -7.716479 -1.935018 3.081061 0.365231 2.270539 -0.493430 1.310316 -0.757843 ... 0.156880 -0.949245 -0.200635 -1.980347 0.855991 -0.438245 2.587744 0.0 0.0 1.0
2 1.106605 -1.545707 -2.163345 -1.813406 0.203347 1.633511 0.791142 1.033637 -0.030198 0.621888 ... -1.352579 1.379190 1.703867 -0.060112 -0.283798 0.661002 1.840297 0.0 0.0 1.0
3 -2.042224 0.823476 -2.285369 -2.270773 -0.668559 -1.578110 -0.257186 -0.827421 -1.256929 1.121218 ... 2.171056 0.204803 0.620946 -2.036895 -1.177620 -0.785916 0.173098 0.0 0.0 1.0
4 -0.581274 0.363198 -3.570156 -4.286992 5.044998 -0.397267 4.118269 0.691859 0.523798 -0.432748 ... 0.952442 -3.772214 -1.089031 1.407776 -0.136638 0.089052 -0.307349 0.0 0.0 1.0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
1295 1.831470 0.953282 -1.071169 -2.489733 1.812463 -0.374221 -3.150657 -2.100736 -0.854273 -0.299772 ... 1.014653 -2.621318 0.962818 0.312637 1.541494 0.898536 -5.367114 0.0 1.0 0.0
1296 -0.063951 -0.471322 3.985702 -2.218718 0.870567 -0.764831 0.873860 0.588270 -2.135085 -1.167302 ... -0.155100 -2.571712 0.679225 -0.681569 -1.413776 0.201078 -3.400294 1.0 0.0 0.0
1297 0.654051 -1.157193 -1.193945 -4.021244 -2.909455 0.086752 2.514698 -1.439090 -0.369860 -0.814236 ... -0.288692 4.636710 0.463763 0.847172 -0.841730 1.457939 5.344950 0.0 0.0 1.0
1298 0.759557 -4.173832 -0.045697 5.252229 -0.855698 -0.798322 2.746694 -1.207565 2.051560 0.536480 ... 2.701097 -2.223971 -2.755301 -1.551487 -0.050303 0.515680 -5.663085 1.0 0.0 0.0
1299 0.694421 2.003003 2.290706 -0.327494 -0.445400 0.827480 0.778321 0.237986 0.690612 -0.251945 ... -0.970881 1.244630 -0.486578 1.107635 -0.723676 1.403534 -0.416362 0.0 0.0 1.0

1300 rows × 33 columns

Let's apply the classifier. Even though we have done hyperparameters selection only on X_train, we will train our model now on all of our dataset, in order to get a model which can hopefully generalize better on the unseen data. While we know that we don't have now a clear idea of the generalization error, we expect that adding also X_test to the training set should not change the best hyperparameters. This is because we expect data to come from the same distribution, and we have seen that more or less the best performing hyperparameters were not subject to drastical changes in the best models. Therefore, we might think that what we have found to be optimal before will also now be optimal. To check another measure of robustness, we could also try to see the cross validation score across the whole dataframe, and see whether it is comparable to the one seen before.

In [ ]:
most_important_features = ['C', 'B', 'A', 'feature_24', 'feature_15', 'feature_7', 'feature_3',
       'feature_4', 'feature_5', 'feature_25', 'feature_21', 'feature_16',
       'feature_18', 'feature_30', 'feature_12', 'feature_2', 'feature_11',
       'feature_8']
final_classifier = RandomForestClassifier(bootstrap = False, ccp_alpha = 0.0, class_weight = None, criterion = 'entropy', max_depth = 40, max_features = 'sqrt', max_leaf_nodes = None, max_samples = None, min_impurity_decrease = 0, min_samples_leaf = 2, min_samples_split = 2, min_weight_fraction_leaf = 0.0, n_estimators = 300, n_jobs = -1, oob_score = False, random_state = 42, warm_start = True)
In [ ]:
print("cross validation score on X_train: ", cross_val_score(final_classifier, X_train, y_train, scoring = 'accuracy', cv = 5, n_jobs = -1).mean())
print("cross validation score on the whole dataframe: ", cross_val_score(final_classifier, df, df_label, scoring = 'accuracy', cv = 5, n_jobs = -1).mean())
cross validation score on X_train:  0.7788461538461539
cross validation score on the whole dataframe:  0.7753846153846153

Notice that the cross validation score is similar to the one obtained before. So we can conclude that the generalization performance will be similar.

In [ ]:
#performance of random forest and logistic regression
test_df_mod = test_df[most_important_features].copy()
df_mod = df[most_important_features].copy()
final_classifier.fit(df_mod, df_label)
y_pred_final = final_classifier.predict(test_df_mod)
y_pred_final
Out[ ]:
array([2, 0, 0, ..., 0, 1, 0], dtype=int64)
In [ ]:
file_path = "./output_test.txt"
np.savetxt(file_path, y_pred_final, delimiter='\n', fmt="%d")